This case introduces students to a relatively comprehensive walkthrough of an exploratory data analysis (EDA) process on an unfamiliar dataset. We want students to learn how to systematically approach investigating an unknown dataset while maintaining creativity to look for insights. By the end, we hope students are confident in tackling datasets outside of a structured environment and appreciate the tremendous value of a proper EDA process for any data science problem.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.metrics as Metrics
import pandas as pd
import matplotlib.pyplot as plt
import folium #needed for interactive map
from folium.plugins import HeatMap
from collections import Counter
from sklearn import preprocessing
from datetime import datetime
from collections import Counter
from math import exp
from sklearn.linear_model import LinearRegression as LinReg
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import r2_score
%matplotlib inline
sns.set()
Business Context. Airbnb is an enormous online marketplace for everyday people to rent places to stay. It is a large and lucrative market, but many vendors are simply individuals who are renting their own primary residence for short visits. Even larger vendors are typically small businesses with only a small number of places to rent. As a result, they have limited ability to assess large-scale trends and set optimal prices.
Airbnb has rolled out a new service to help listers set prices. Airbnb makes a percentage commission off of the listings, so they are incentivized to help listers price optimally; that is, at the maximum possible point where they will still close a deal. You are an Airbnb consultant helping with this new pricing service.
Business Problem. Your initial task is to explore the data with the goal of answering the question: "What features are most relevant to the price of an Airbnb listing?"
Analytical Context. We will use the publicly available and well-maintained dataset created by the Inside Airbnb advocacy group. We will focus on listings in New York City within the last year, taking advantage of larger datasets when there are important details to explore.
The case is structured as follows: we will (1) do basic data exploration by plotting distributions of key quantities; (2) introduce the concept of correlation to find the key features; (3) introduce the idea of interaction effects to correct for the effects of key features; (4) discuss how to iteratively generate hypotheses and choose data visualizations to support your conclusions; (5) look at one very specific type of interaction effect, the temporal effect, and how to correct for it; and finally (6) pull everything together to identify the key factors that affect the price.
We begin by loading the data and looking at its basic shape:
listings = pd.read_csv('airbnb_nyc.csv', delimiter=',')
listings.shape
Let's also look at the columns of the dataset:
listings.columns
# We display the basic listings data.
pd.options.display.max_columns = 100
listings.head(3)
The following are details about some of the important columns here:
neighbourhood: which neighborhood the property is inlongitude, latitude: longitude and latitudeproperty_type: type of property, such as apartment, condo etc.bathrooms: number of bathroomsbedrooms: number of bathroomsprice: price of the listingnumber_of_reviews: number of reviews given by customers who stayed thereparking: 1 means there is parking available, -1 means there is notFor other categorical variables, such as outdoor_friendly, gym, etc., the 1,-1 should be interpreted similarly to parking as explained above.
As you have seen in the Python cases, it is good to first develop an idea of how the values of a few key quantities of interest are distributed. We always start investigating by gaining an overhead view of various parameters in our data. Let's start by doing so for some numeric variables, such as price, bedrooms, bathrooms, number_of_reviews:
Use the describe() and quantile() commands to compute some important summary statistics for the above variables.
Answer. One possible solution is given below:
listings[['price','bedrooms','bathrooms','number_of_reviews']].describe()
listings['price'].quantile([0.9,0.95,0.99])
listings['number_of_reviews'].describe()
Use the plt.hist() function to plot the histogram of the above variables. What are their basic shapes (e.g. normal, skewed, multi-modal, etc.)?
Answer. All look somewhat skewed to the right, though the bathroom variable is so concentrated at a single entry that it is hard to tell.
plt.figure(figsize=(12,10))
vars_to_plot = ['price', 'bedrooms','bathrooms','number_of_reviews']
for i, var in enumerate(vars_to_plot):
plt.subplot(2,2,i+1)
plt.hist(listings[var],50)
title_string = "Histogram of " + var
plt.title(title_string)
Are the distributions fairly smooth, or do they exhibit "spiky" or "discontinuous" behavior? If the latter, can you explain where it might come from?
Answer. The price variable is noticeably spiky. There is a nice bulk of prices between about 25 and 300 dollars, with very obvious spikes at nice, round numbers such as 50, 100, 150, 200, 250, and 300. This probably reflects the fact that people enter in the prices that they wish to list at, and so tend to choose round numbers (or numbers just below round numbers).
Can you detect any outliers from these histograms? If so, do they suggest (i) data error; or (ii) data that should be omitted from our future analysis?
Answer. Very few places had prices of more than $320, and so we might think of these as "outliers". Some of these may represent error, but we guess that most of them are correct – hotels in NYC certainly often go for over 400 dollars per night, and so it is not unreasonable to expect some Airbnb listings of this price. The question as to whether we should omit these outliers is a little more difficult, but we lean towards omitting them for most clients. Even if these prices are correct, we suspect that they are governed by idiosyncratic factors that are not as relevant to the listings that most of our clients are interested in analyzing. Thus, they will tend to give us misleading (or "biased") results.
It is important to note that we never omit data due to convenience or for arbitrary reasons. We only omit data when there is a logical reason that including such data would ruin our analysis.
Sometimes, it is better to look at a histogram which plots the relative percentages of values across categories:
# How many bedrooms
bedrooms_counts = Counter(listings.bedrooms)
tdf = pd.DataFrame.from_dict(bedrooms_counts, orient = 'index').sort_values(by = 0)
tdf = (tdf.iloc[-10:, :] / len(listings)) * 100
# Sort bedroom dataframe by number
tdf.sort_index(axis = 0, ascending = True, inplace = True)
# Plot percent of listings by bedroom number
ax = tdf.plot(kind = 'bar', figsize = (12, 7.5))
ax.set_xlabel("# Bedrooms")
ax.set_ylabel("% Listings")
ax.set_title('% Listings by Bedrooms')
ax.legend_.remove()
plt.show()
print("Percent of 1 Bedroom Listings: %{0:.2f}".format(tdf[0][1]))
#The syntax 0:.2f denotes that we will print upto to decimal places
#Change it to %{0:.3f to see what happens}
Now that we have looked at the variables of interest in isolation, it makes sense to look at them in relation to price.
Answer. One possible solution is given below:
plt.figure(figsize=(12,10))
vars_to_plot = ['bedrooms','bathrooms','number_of_reviews','review_scores_cleanliness']
for i, var in enumerate(vars_to_plot):
plt.subplot(2,2,i+1)
sns.boxplot(x = var, y='price', data = listings)
title_string = "Boxplot of Price vs. " + var
plt.ylabel("Price")
plt.title(title_string)
Comment on the relationship between price and the respective variable in each of the above plots.
Answer.
Although plotting the relationship between price and a few other variables is a good first step, overall there are too many variables to individually plot and manually inspect. We need a more systematic method. How do we proceed? An easy way to get a quick overview of the key variables that affect the price is via correlation.
Let's look at the price vs. bedrooms plot again:
plt.figure(figsize=(10,8))
plt.subplot(121)
sns.scatterplot(x='bedrooms',y = 'price', data = listings)
plt.ylabel("price")
plt.title("Scatterplot of Price vs. Bedrooms")
plt.subplot(122)
sns.boxplot(x='bedrooms',y= 'price', data = listings)
plt.ylabel("price")
plt.title("Boxplot of Price vs. Bedrooms")
We see that as the number of bedrooms increases, the price on average increases. The quantity correlation is one way to capture this relationship. The correlation of two quantities is a measurement of how much they tend to increase together, measured on a scale going from -1 to 1. A positive correlation between price and number of bedrooms would indicate that higher-priced listings tend to have more bedrooms. Similarly, a negative correlation between price and number of bedrooms would indicate that higher-priced listings tend to have fewer bedrooms. In our case, we can easily see that price is positively correlated with bedrooms.
Since correlation is just a single number summarizing an entire joint distribution, it can be misleading and does not eliminate the need to plot and visually inspect the key variables that it suggests are important. Nonetheless, it is quite helpful when quickly scanning for very strong relationships in the data and whittling down a much larger list of potential factors.
np.corrcoef(listings['price'],listings['bedrooms'])[0,1]
The correlation matrix then gives all of the pairwise correlations between all of the variables. We can get a quick overview of the key variables that affect the price by looking at its row in the correlation matrix.
Write code to compute the correlation matrix between the price and other quantities. (use .corr() function).
Print the columns which are positively correlated, in increasing order of the correlation.
Print the columns which are negatively correlated, in increasing order of the magnitude of the correlation.
Answer. One possible solution to all three parts above is shown below:
# Create a correlation matrix
corr = listings.corr()
pos_cor = corr['price'] >0
neg_cor = corr['price'] <0
corr['price'][pos_cor].sort_values(ascending = False)
#This prints out the coefficients that are positively correlated with price.
corr['price'][neg_cor].sort_values()
From the table above, what factors are most correlated with price? Which correlations are surprising?
Answer. Many of these are unsurprising – for example, the largest correlations are with measures of size (accommodates, bedrooms, beds, etc.). Review scores are only slightly related to price. Looking at the location-related scores, we find that longitude is negatively related to price while latitude is not. This motivates us to plot them on a map (we will do this next).
We also notice a few correlations that seem a bit surprising. For example:
In Exercise 3, we found quite a few variables that are reasonably correlated with price. We could continue our exploration by looking at each of these variables in turn, but we know that in real estate, location data is quite special, and so we will first explore how location affects the price in greater detail.
We will use the folium package. Make sure that you have installed the package (if not, do it now!). The following gives an interactive map for plotting the listings on a map of New York City:
folium_map = folium.Map(location=[40.738, -73.98],
zoom_start=13,
tiles="OpenStreetMap")
folium_map
# This sets up a basic map of NYC. You can try to change the "tiles" option above.
# The options you have are: "OpenStreetMap", Mapbox Bright","Stamen Toner","Mapbox Control Room","Stamen Terrain"
folium_map = folium.Map(location=[40.738, -73.98],
zoom_start=13,
tiles="OpenStreetMap")
#Now we can have a scatter plot of the first 1000 data points on the above map
for i in range(0,1000):
marker = folium.CircleMarker(location=[listings["latitude"][i],listings["longitude"][i]],radius=5,color="r",fill=True)
marker.add_to(folium_map)
folium_map
Next, we create a heatmap of the price of apartments in NYC. This will give us a sense of where the important locations are:
max_amount = float(listings['price'].max())
folium_hmap = folium.Map(location=[40.738, -73.98],
zoom_start=13,
tiles="OpenStreetMap")
hm_wide = HeatMap( list(zip(listings['latitude'], listings['longitude'], listings['price'])),
min_opacity=0.2,
max_val=max_amount,
radius=8, blur=6,
max_zoom=15,
)
folium_hmap.add_child(hm_wide)
Looking at this map, you can (roughly) see the correlation between price and longitude/latitude. Does location appear to be strongly related to price? Does it seem likely that this relationship could be captured well by a linear model?
Answer. The location does seem to be strongly related to price; however, the relationship doesn't seem to be close to linear. This suggests that, when we model price, we will need to incorporate location data and our method cannot be via a linear mapping.
Write code to make a scatterplot between price and longitude, with number of bedrooms categorized by color.
Answer. We can also plot longitude vs. price:
sns.scatterplot(x= listings['longitude'], y = listings['price'], hue = listings['bedrooms'])
When looking at the list of correlations, parking stood out as having a surprisingly negative correlation with price. We've seen that location has a strong influence on price; let's see if it can help explain the negative correlation exhibited by parking.
Write code here to plot the first 1000 locations on the map where parking is available by blue color, and the first 1000 locations where parking is not available by red color. (Hint: You can use the command: color = "blue" and "red" respectively.)
Answer. One possible solution is given below:
lat_log_parking_yes = listings.loc[ listings['parking']==1.0, ["latitude","longitude" ] ]
lat_log_parking_no = listings.loc[ listings['parking']==-1.0, ["latitude","longitude" ] ]
folium_map = folium.Map(location=[40.738, -73.98],
zoom_start=13,
tiles="OpenStreetMap")
for i in range(1000):
marker = folium.CircleMarker(location=[lat_log_parking_yes["latitude"].iloc[i],lat_log_parking_yes["longitude"].iloc[i]],radius=5,color="blue",fill=True)
marker.add_to(folium_map)
for i in range(1000):
marker = folium.CircleMarker(location=[lat_log_parking_no["latitude"].iloc[i],lat_log_parking_no["longitude"].iloc[i]],radius=5,color="red",fill=True)
marker.add_to(folium_map)
folium_map
Now that we have explored some of the factors that are expected to affect price, let's focus on understanding the unexpected correlations, such as the negative correlation with parking. We start with the latter:
# First, plot parking vs. non-parking prices.
sns.kdeplot(listings.loc[listings['parking'] == 1,'price'],shade = True, label="Parking",color="g")
sns.kdeplot(listings.loc[listings['parking'] == -1,'price'],shade = True, label="No Parking",color="r")
plt.title("Density plot of Price for Parking vs. No Parking");
We saw before that the correlation between price and parking is -0.019383. Since parking is desirable, we expect the price to increase with parking. When we see a pattern like this, we should suspect the existence of interaction effects that are complicating the parking vs. price relationship. Interaction effects are when the relationship between two variables is conditional, or depends on the value of a third, hidden variable.
We strongly recommend that you lean on your domain expertise when looking at data and really explore correlation/relationships that appear funny or weird or counterintuitive. We may find interaction effects or we may find that certain long-held assumptions from our domain expertise are not correct (e.g. maybe parking is not desirable in certain parts of the world).
So, what could this third variable potentially be? Well, we have seen that location has a huge impact on prices. Perhaps high-price areas don't have many parking spots, whereas low-price areas do? We don't know this for sure, but it's a worthwhile guess.
More formally, we hypothesize that this observed negative correlation is the result of interaction effects arising from location. In order to investigate this hypothesis, we ought to break down the locations by neighborhood and see if this negative correlation between price and parking still holds within neighborhoods. The neighborhoods are discrete and there are many listings per neighborhood, so we can simply compute the correlation for every neighborhood individually. Mathematically, this is exactly the same thing as conditioning on the neighborhood and computing the conditional correlation.
neighbourhoods = listings.neighbourhood.unique()
cvec = list()
cvec = dict()
for x in neighbourhoods:
temp = listings[listings['neighbourhood'] == x]
cvec[x] = temp.corr()['price']['parking']
res = list(cvec.values())
res = [x for x in res if str(x) != 'nan']
res.sort()
plt.hist(res, bins=20)
plt.ylabel('Correlation')
plt.show()
print('Average correlation: ', sum(res)/len(res))
# print(cvec)
#print(list(cvec).sort())
Explain the relationship between the histogram and our finding that parking is negatively correlated with price.
Answer. Our original correlation of about $-0.02$ was the correlation between price and parking for all listings in NYC – that is, the conditional correlation between price and parking given that you are in NYC. The number res['Brooklyn'] is the correlation between price and parking for all listings in Brooklyn – that is, the conditional correlation between price and parking given that you are in Brooklyn.
The histogram shows us that most of the conditional correlations within neighborhoods are positive, even though the correlation across all of NYC is negative. Roughly speaking, this means that the following are all occurring:
The correlation values of 1 and -1 are presumably due largely to neighborhoods with very few listings, and should essentially be ignored. Viewing the histogram, however, we can see that a clear majority of correlations are at least slightly positive, for an average correlation of 0.08.
Plot the histogram that overlays the distribution of price for parking and non-parking (use sns.kdeplot) for the neighborhoods: St. George,Greenwood Heights,Rego Park,Brooklyn Navy Yard.
If we plot this by neighborhood for a few neighborhoods, we can see this somewhat positive correlation of parking vs. no parking visually:
plt.figure(figsize=(12,10))
neigh_to_look = ['St. George','Greenwood Heights','Rego Park','Brooklyn Navy Yard']
for i, neigh in enumerate(neigh_to_look):
plt.subplot(2,2,i+1)
sns.kdeplot(listings.loc[(listings['parking'] == 1) & (listings['neighbourhood'] == neigh),'price'],shade = True, label="Parking",color="g")
sns.kdeplot(listings.loc[(listings['parking'] == -1) & (listings['neighbourhood'] == neigh),'price'],shade = True, label="No Parking",color="r")
plt.title("Parking vs. No Parking for neighboorhood = " + str(neigh));
As we have seen, the existence of unexpected correlations should spur investigation into potential interaction effects, which lead to potentially interesting hypotheses. Thus, one good way of generating iterative hypotheses is to find and think about potential interaction effects.
We saw that finding conditional correlations or interactions is a good way to generate further hypotheses, as many interesting lines of investigation arise from investigating these confounding variables. Here is another example: let's now look at how price varies with property type. The following code plots the price of a one bedroom listing broken down by the property type:
plt.figure(figsize=(12,10))
sns.boxplot(y=listings.loc[listings['bedrooms']==1,'price'], x= listings.loc[listings['bedrooms']==1,'property_type'])
plt.xticks(rotation = 90)
What can you conclude about the variation in price of a one bedroom by the property type?
Answer. There is significant variation in price according to the property type; a room in a house or a loft is the cheapest, while cabins, boutique hotels, and boats are very expensive! It is also interesting to see huge variations in hotel prices.
Do the same price vs. property type plot for two bedroom listings.
Answer. One possible solution is given below:
plt.figure(figsize=(12,10))
sns.boxplot(y=listings.loc[listings['bedrooms']==2,'price'], x= listings.loc[listings['bedrooms']==2,'property_type'])
plt.xticks(rotation = 90)
Pick any other variable of your choice and make another interactive plot, showing the variation of price broken down by sub-categories of that variable.
So far, we have primarily been interested in understanding what factors influence the price of an existing listing. However, a natural question to ask is what you can do to increase the price of your own listing without spending too much money?
Looking through the list of correlations, two immediately stand out:
| Object | Correlation |
|---|---|
| AC | 0.18 |
| TV | 0.26 |
Some air conditioning is relatively cheap, and has a large correlation of 0.189. Televisions are very cheap, and have an enormous correlation of 0.26. This suggests that one of the best things I can do for a listing is run out and buy a TV if I don't have one – a 300 dollar TV might increase the value of a listing by 30 dollars per night!
Do you really believe this could work? Can you come up with any simple plot or explanation that would argue one way or the other?
Apartments with TVs are systematically better than those without – indeed, presence of a television is positively correlated with many other positive price signals, from the size of the listing (e.g. number of people it accommodates) to essentially all other amenities. Nevertheless, it is important to not quickly conclude everything simply based on correlations. We will discuss this much more carefully in a later class on causal inference.
We have seen that conditional plots can be a useful way to "correct" comparisons by taking into account interaction effects.
Time is a very common interaction effect that appears across lots of datasets. For Airbnb data, this is especially important, as Airbnb is often more expensive near holidays, and so reasonable price estimates must take this into account. In practice this is one of the most important corrections offered by Airbnb pricing consultancy firms, and corrections usually take advantage of data pooled from many somewhat similar cities. This is vital to achieving good corrections, but it is easy to make mistakes by failing to account for important city-to-city differences.
We begin by opening up the calendar data and counting (i) the number of rentals per day; and (ii) their total prices:
cal = pd.read_csv('scal.csv', delimiter=',')
cal.head()
# Count rentals and total price on each date.
rcount = dict()
rprice = dict()
for row in cal.itertuples(index=True, name='Pandas'):
rcount[str(row[1])] = rcount.get(str(row[1]), 0) + 1
rprice[str(row[1])] = rprice.get(str(row[1]), 0) + row[2]
rcount
rprice
# Next, plot the results.
tempcount = sorted(rcount.items())
x, y = zip(*tempcount)
tempprice = sorted(rprice.items())
u,v = zip(*tempprice)
# plt.plot(x, y)
# plt.show()
# Next, we look at average price
tempprice = sorted(rprice.items())
u,v = zip(*tempprice)
ratio = lambda a,b: float(a)/float(b)
avgprice = list(map(ratio, v,y))
xd = pd.to_datetime(x)
plt.figure(figsize=(12,10))
plt.plot(xd,avgprice)
plt.xticks(rotation = 'vertical')
plt.ylabel('Average price')
plt.title("Average price vs. Date")
plt.show()
#Let us also plot a smaller time interval
plt.figure(figsize=(12,10))
plt.plot(xd[0:28],avgprice[0:28])
plt.xticks(rotation = 'vertical')
plt.ylabel('Average price')
plt.title("Average price vs. Date")
plt.show()
When analyzing time series data like this, it is common to view it as a sum of several contributing effects over time plus noise. The two common types of summands in such a representation are:
Visually, can you see any strong seasonal or trend components? What do they mean?
Answer.
What is the enormous spike that you see in this chart? Is it real, and how would you describe what is going on in layman's terms?
Answer. This spike occurs at Christmas, the busiest holiday season. We expect it every year and must incorporate it in any reasonable model.
Can we guess the busiest season (excluding Christmas) from this raw chart?
Answer. This would be difficult. Notice that this chart covers about a year, but there is a clear discontinuity if you try to "wrap" the data (i.e. the difference between the first and last day on this chart is significant). This is caused by an underlying trend of increasing prices every year. To figure out the best season, you would need to extract out this trend, which is difficult to do from a single year's data in a single city.
This brings us to an important topic: bringing in auxiliary datasets! The Inside Airbnb website includes calendar data for many cities, and we can use these to adjust for the trend component. To get some diversity, we should make sure to source some data from: (i) a city close to NYC; (ii) a city in the US with very different weather; and (iii) some cities very far away.
In this case, we saw that Airbnb prices are influenced by many factors. Some of the main ones include location, date, number of bedrooms, number of guests, and property type.
Any future model we build should feature these factors. Incorporating some of these factors, such as the number of bedrooms, should be straightforward, as this has a large and nearly linear relationship to price. But others, such as location, exhibit very non-linear relationships. We will learn how to deal with these types of complex relationships in future cases.
We also found some surprising correlations, such as the negative correlation between price and parking. However, after breaking the data down by neighborhoods and incorporating the interaction effect of location, this negative correlation went away entirely.
Temporal effects are a very specific type of interaction effect which must be dealt with separately. Our exploration tells us that any model of AirBnB pricing should take into account strong seasonal components as well as strong spikes around major holidays.
In this case, you learned the following exploration process:
This process can be a bit daunting at first, but it is widely used by veteran data analysts and scientists and is extremely effective in most situations.
We assume this walkthrough was new to many of you, so there is a large amount of learning to digest here. We strongly recommend that you redo/review this case again by yourself until each step seems logical. It is crucial that you feel comfortable with this EDA process as we move towards later cases. By iteratively generating hypotheses throughout this process and investigating them, you can uncover great insight about what is going on without building a single formal model. Formal modeling will be discussed in future cases.